We analyze two variables to find out if there might be some kind of association between them. Even though that may be difficult to clearly identify, bivariate analysis still helps reveal signs of association that may serve at least to raise concern.
Let me use the data about crime from the Seattle Open Data portal (I have formatted this file previously).
# clear memory
rm(list = ls())
# collecting the data
link="https://github.com/EvansDataScience/data/raw/master/crime.RData"
load(file = url(link)) #loaded as 'crime'
Let’s see what kind of data we have in crime data table:
# categorical? numerical?
str(crime,width = 50,strict.width='cut')
## 'data.frame': 499698 obs. of 17 variables:
## $ Report.Number : chr "201300002"..
## $ Occurred.Date : Date, format: ...
## $ year : num 2013 2013 2..
## $ month : num 7 7 7 7 7 7..
## $ weekday : Ord.factor w/ 7"..
## $ Occurred.Time : num 1930 1917 1..
## $ Occurred.DayTime : Ord.factor w/ 4"..
## $ Reported.Date : Date, format: ...
## $ Reported.Time : num 1722 2052 3..
## $ DaysToReport : num 1 0 1 1 0 0..
## $ crimecat : Factor w/ 20 le"..
## $ Crime.Subcategory : Factor w/ 30 le"..
## $ Primary.Offense.Description: Factor w/ 144 l"..
## $ Precinct : Factor w/ 5 lev"..
## $ Sector : Factor w/ 23 le"..
## $ Beat : Factor w/ 64 le"..
## $ Neighborhood : Factor w/ 58 le"..
While we use frequency tables in the univariate case, for the bivariate case (cat-cat) we prepare contingency tables. Let’s select a couple of categorical variables:
This contingency table shows counts for each combination of levels:
(PrecinctDaytime=table(crime$Precinct,crime$Occurred.DayTime))
##
## day afternoon evening night
## EAST 15976 20774 17380 19880
## NORTH 33744 48754 39867 37942
## SOUTH 17322 22147 16240 15497
## SOUTHWEST 10595 14221 11169 11034
## WEST 30366 48931 33766 30925
When a table tries to hypothesize a relationship, you should have the independent variable in the columns, and the dependent one in the rows. Interpretation is difficult when you have counts so it is better to have percents. Percents should be computed by column to see how the levels of the dependent variable varies by each level of the independent one (reading along rows):
# computing column percent from contingency table
library(magrittr) # for %>%
(PrecDayti_mgCol=prop.table(PrecinctDaytime,
margin = 2) #2 means by column
%>%round(.,3))
##
## day afternoon evening night
## EAST 0.148 0.134 0.147 0.172
## NORTH 0.312 0.315 0.337 0.329
## SOUTH 0.160 0.143 0.137 0.134
## SOUTHWEST 0.098 0.092 0.094 0.096
## WEST 0.281 0.316 0.285 0.268
The previous table shows you how the crimes that occur in a precinct are affected by the time they happen. So you need a plot that allows to highlight those differences accross time.
As before, we need to turn this table into a data frame:
#making a data frame from contingency table
(PrecDaytiDF=as.data.frame(PrecinctDaytime))
## Var1 Var2 Freq
## 1 EAST day 15976
## 2 NORTH day 33744
## 3 SOUTH day 17322
## 4 SOUTHWEST day 10595
## 5 WEST day 30366
## 6 EAST afternoon 20774
## 7 NORTH afternoon 48754
## 8 SOUTH afternoon 22147
## 9 SOUTHWEST afternoon 14221
## 10 WEST afternoon 48931
## 11 EAST evening 17380
## 12 NORTH evening 39867
## 13 SOUTH evening 16240
## 14 SOUTHWEST evening 11169
## 15 WEST evening 33766
## 16 EAST night 19880
## 17 NORTH night 37942
## 18 SOUTH night 15497
## 19 SOUTHWEST night 11034
## 20 WEST night 30925
We also have the table with marginal percents by column:
as.data.frame(PrecDayti_mgCol)
## Var1 Var2 Freq
## 1 EAST day 0.148
## 2 NORTH day 0.312
## 3 SOUTH day 0.160
## 4 SOUTHWEST day 0.098
## 5 WEST day 0.281
## 6 EAST afternoon 0.134
## 7 NORTH afternoon 0.315
## 8 SOUTH afternoon 0.143
## 9 SOUTHWEST afternoon 0.092
## 10 WEST afternoon 0.316
## 11 EAST evening 0.147
## 12 NORTH evening 0.337
## 13 SOUTH evening 0.137
## 14 SOUTHWEST evening 0.094
## 15 WEST evening 0.285
## 16 EAST night 0.172
## 17 NORTH night 0.329
## 18 SOUTH night 0.134
## 19 SOUTHWEST night 0.096
## 20 WEST night 0.268
We should simply add the last column to the data frame of counts.
PrecDaytiDF$share=as.data.frame(PrecDayti_mgCol)[,3]
PrecDaytiDF
## Var1 Var2 Freq share
## 1 EAST day 15976 0.148
## 2 NORTH day 33744 0.312
## 3 SOUTH day 17322 0.160
## 4 SOUTHWEST day 10595 0.098
## 5 WEST day 30366 0.281
## 6 EAST afternoon 20774 0.134
## 7 NORTH afternoon 48754 0.315
## 8 SOUTH afternoon 22147 0.143
## 9 SOUTHWEST afternoon 14221 0.092
## 10 WEST afternoon 48931 0.316
## 11 EAST evening 17380 0.147
## 12 NORTH evening 39867 0.337
## 13 SOUTH evening 16240 0.137
## 14 SOUTHWEST evening 11169 0.094
## 15 WEST evening 33766 0.285
## 16 EAST night 19880 0.172
## 17 NORTH night 37942 0.329
## 18 SOUTH night 15497 0.134
## 19 SOUTHWEST night 11034 0.096
## 20 WEST night 30925 0.268
We can change the names of the previous data frame:
names(PrecDaytiDF)[1:3]=c("precinct","daytime","counts")
#then
PrecDaytiDF
## precinct daytime counts share
## 1 EAST day 15976 0.148
## 2 NORTH day 33744 0.312
## 3 SOUTH day 17322 0.160
## 4 SOUTHWEST day 10595 0.098
## 5 WEST day 30366 0.281
## 6 EAST afternoon 20774 0.134
## 7 NORTH afternoon 48754 0.315
## 8 SOUTH afternoon 22147 0.143
## 9 SOUTHWEST afternoon 14221 0.092
## 10 WEST afternoon 48931 0.316
## 11 EAST evening 17380 0.147
## 12 NORTH evening 39867 0.337
## 13 SOUTH evening 16240 0.137
## 14 SOUTHWEST evening 11169 0.094
## 15 WEST evening 33766 0.285
## 16 EAST night 19880 0.172
## 17 NORTH night 37942 0.329
## 18 SOUTH night 15497 0.134
## 19 SOUTHWEST night 11034 0.096
## 20 WEST night 30925 0.268
We will use ggplot to represent the contingency table:
library(ggplot2)
base1=ggplot(data=PrecDaytiDF,
aes(x=daytime,
y=share,
fill=precinct)) # fill brings a legend
Then, you play with some positions for the bar. First, the dodge style:
barDodge= base1 + geom_bar(stat="identity",
position ='dodge')
barDodge
The second is the stack style:
barStacked = base1 + geom_bar(stat = "identity",
position = 'stack')#default
barStacked
The stacked version will help more than the dodged one as it reveals better the values in the contingency table:
PrecDayti_mgCol
##
## day afternoon evening night
## EAST 0.148 0.134 0.147 0.172
## NORTH 0.312 0.315 0.337 0.329
## SOUTH 0.160 0.143 0.137 0.134
## SOUTHWEST 0.098 0.092 0.094 0.096
## WEST 0.281 0.316 0.285 0.268
So, we continue with adding some elements to this one:
library(scales)
#annotating
barStackedAnn= barStacked + geom_text(size = 5,# check below:
position = position_stack(vjust = 0.5),# center
aes(label=percent(share,accuracy = 0.1)))# percent format
barStackedAnn = barStackedAnn + scale_y_continuous(labels = percent)
barStackedAnn
Since the precinct is nominal, and you see some marked differences along the percents, you can reorder the precinct blocks with reorder():
base1=ggplot(data=PrecDaytiDF,
aes(x=daytime, y=share,
fill=reorder(precinct,share))) ## reordering
barStacked = base1 + geom_bar(stat = "identity",
position = 'stack')
barStacked= barStacked + geom_text(size = 5,
position = position_stack(vjust = 0.5),
aes(label=percent(share,accuracy = 0.1)))
barStacked = barStacked + scale_y_continuous(labels = percent)
barStacked
Let me show you a more complex situation:
# contingency table with many levels:
(CrimeDay=table(crime$crimecat,crime$Occurred.DayTime))
##
## day afternoon evening night
## AGGRAVATED ASSAULT 3564 5366 4884 7501
## ARSON 196 167 191 486
## BURGLARY 24139 22288 14121 16082
## CAR PROWL 26740 38273 42595 34839
## DISORDERLY CONDUCT 41 81 67 79
## DUI 706 939 2038 8522
## FAMILY OFFENSE-NONVIOLENT 1748 2516 1217 1120
## GAMBLE 4 4 7 2
## HOMICIDE 41 46 49 131
## LIQUOR LAW VIOLATION 112 491 410 606
## LOITERING 20 31 25 9
## NARCOTIC 2415 6416 3924 4109
## PORNOGRAPHY 65 53 17 31
## PROSTITUTION 115 675 1425 1340
## RAPE 332 318 354 854
## ROBBERY 2584 4737 4139 5372
## SEX OFFENSE-OTHER 1501 1759 1014 1776
## THEFT 38687 64868 38980 28410
## TRESPASS 4848 5184 2598 3289
## WEAPON 735 1445 947 1624
This contingency table has one categorical variables with several levels, let’s prepare a data frame as before:
#making a data frame from contingency table
CrimeDayDF=as.data.frame(CrimeDay)
#marginal
CrimeDay_mgCol=prop.table(CrimeDay,margin = 2)
#renaming:
names(CrimeDayDF)=c("crime","daytime","counts")
#adding marginal
CrimeDayDF$share=as.data.frame(CrimeDay_mgCol)[,3]
# result for ggplot:
head(CrimeDayDF,20)
## crime daytime counts share
## 1 AGGRAVATED ASSAULT day 3564 3.281980e-02
## 2 ARSON day 196 1.804905e-03
## 3 BURGLARY day 24139 2.222887e-01
## 4 CAR PROWL day 26740 2.462405e-01
## 5 DISORDERLY CONDUCT day 41 3.775566e-04
## 6 DUI day 706 6.501340e-03
## 7 FAMILY OFFENSE-NONVIOLENT day 1748 1.609680e-02
## 8 GAMBLE day 4 3.683479e-05
## 9 HOMICIDE day 41 3.775566e-04
## 10 LIQUOR LAW VIOLATION day 112 1.031374e-03
## 11 LOITERING day 20 1.841739e-04
## 12 NARCOTIC day 2415 2.223900e-02
## 13 PORNOGRAPHY day 65 5.985653e-04
## 14 PROSTITUTION day 115 1.059000e-03
## 15 RAPE day 332 3.057287e-03
## 16 ROBBERY day 2584 2.379527e-02
## 17 SEX OFFENSE-OTHER day 1501 1.382225e-02
## 18 THEFT day 38687 3.562568e-01
## 19 TRESPASS day 4848 4.464376e-02
## 20 WEAPON day 735 6.768392e-03
Sometimes, a simple contingency table does not need to be plotted in order to reveal salient relationships; but in this case a visual may be needed.
As before, let’s request a stacked barplot:
# bad idea
base2=ggplot(data=CrimeDayDF,
aes(x=daytime,y=share,fill=crime))
base2=base2 + geom_bar(stat = "identity", position = 'fill') +
geom_text(size = 3,
position = position_stack(vjust = 0.5),
aes(label=percent(share,accuracy = 0.1)))
barStacked2 = base2 + scale_y_continuous(labels = percent)
barStacked2
This plot will need a lot of work, so using it may not be a good strategy.
A first option you be to use a barplot with facets with bars dodged. Let’a make the first attempt.
# base with only X and Y in 'aes()'
baseBar = ggplot(CrimeDayDF, aes(x = crime, y = share ) )
#the bars
barPlot = baseBar + geom_bar( stat = "identity" )
barPlot
Now see the facets:
# bar per day time with 'facet'
barsFt = barPlot + facet_grid(~ daytime)
barsFt
This does not look like the crosstable yet; let’s solve that:
barsFt + coord_flip()
The type of crime is not ordinal, then we could reorder the bars:
# new base
baseRE = ggplot(CrimeDayDF,
aes(x = reorder(crime, share), #here
y = share ) ) + theme_minimal()
barPlotRE = baseRE + geom_bar( stat = "identity" )
barFtRE = barPlotRE + facet_grid( ~ daytime)
barFtRE= barFtRE + coord_flip()
barFtRE
Let’s work on the crime labels
barFtRE=barFtRE + theme(axis.text.y = element_text(size=7,angle = 20))
barFtRE
Would you annotate the bars:
barREann= barFtRE+ geom_text(aes(label=round(share,2)),
nudge_y = 0.1)
barREann
Let’s annotate conditionally instead:
barCond=barFtRE + geom_text(aes(label=ifelse(share>0.1,# condition to annotate
round(share,2),"")),
nudge_y = 0.1)
barCond
What about percents instead:
barFtRE + geom_text(aes(label=ifelse(share>0.1,
percent(share,accuracy = 1),# %
"")),
nudge_y = 0.1,size=3) +
scale_y_continuous(labels = percent_format(accuracy = 1,suffix="")) #%
Let’s keep using the same data on crimes. The next cases will show how a categorical variable can help us better understand the behavior of a numeric variable.
Let’s take a look at a variable that informs the amount of days it takes someone to report a crime:
# stats of days to report
# notice the spread of values.
summary(crime$DaysToReport)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 0.00 7.65 1.00 36525.00 2
The max is a very high value. Let me see the crimes that took the longest:
crime[which.max(crime$DaysToReport),]
## Report.Number Occurred.Date year month weekday Occurred.Time
## 6783 20080000465209 1908-12-13 1908 12 Sunday 2114
## Occurred.DayTime Reported.Date Reported.Time DaysToReport crimecat
## 6783 evening 2008-12-13 2114 36525 DUI
## Crime.Subcategory Primary.Offense.Description Precinct Sector Beat
## 6783 DUI DUI-LIQUOR EAST G G2
## Neighborhood
## 6783 CENTRAL AREA/SQUIRE PARK
Do you think this is right? This looks like a mistyping, as the Reported.Date is very similar, exactly 100 years later. Let’s alter the Ocurred.Date value.
crime[6783,'Occurred.Date']=crime[6783,'Reported.Date']
We also need to recompute the value DaysToReport value:
crime[6783,'DaysToReport']=difftime(crime[6783,'Reported.Date'],
crime[6783,'Occurred.Date'],
units = "days")%>%as.numeric()
The weekday and the year may need to be updated:
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
crime[6783,'weekday']=wday(crime[6783,'Occurred.Date'],
label=TRUE,
abbr = FALSE)
crime[6783,'year']=year(crime[6783,'Occurred.Date'])
Let’s use again the category Precinct with the numerical DaysToReport. Let’s just keep the non-missing data in the table this time:
crime_nona=crime[complete.cases(crime),]
Let me get the median for each precinct:
# summary: median by groups
aggregate(data=crime_nona, DaysToReport~Precinct,median)
## Precinct DaysToReport
## 1 EAST 0
## 2 NORTH 0
## 3 SOUTH 0
## 4 SOUTHWEST 0
## 5 WEST 0
As you see, 50% of the cases are reported the same day. Let’s request a boxplot for each precinct:
# boxplot of days to report per precinct
library(ggplot2)
base=ggplot(data=crime_nona,
aes(x=Precinct,y=DaysToReport))
base + geom_boxplot()
The plot above would not give so much insight, there is so much noise. Let’s check other statistics beside the median:
# using "summary" function
tapply(crime_nona$DaysToReport,
crime_nona$Precinct, summary)
## $EAST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 6.726 1.000 10622.000
##
## $NORTH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 7.104 1.000 14268.000
##
## $SOUTH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 7.22 1.00 11292.00
##
## $SOUTHWEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 11.07 1.00 11992.00
##
## $WEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 4.899 1.000 16801.000
From the information above, you know that for each precinct, the 75% of crimes are reported in a day or less. If we consider that situation as the expected behavior, let me keep the ones that take more than a day using ggarrange:
str(crime)
library(ggpubr)
baseWeek=ggplot(data=crime_nona[crime_nona$DaysToReport>=7,],
aes(x=Precinct,y=DaysToReport))
boxWeek=baseWeek + geom_boxplot() + labs(title = "week and above")
baseMonth=ggplot(data=crime_nona[crime_nona$DaysToReport>=30,],
aes(x=Precinct,y=DaysToReport))
boxMonth=baseMonth + geom_boxplot() + labs(title = "month and above")
baseYear=ggplot(data=crime_nona[crime_nona$DaysToReport>=365,],
aes(x=Precinct,y=DaysToReport))
boxYear=baseYear + geom_boxplot() + labs(title = "year and above")
#all in one:
ggarrange(boxWeek,boxMonth,boxYear,ncol = 1)
Up to this point, you need to be planing a good story. The situation is different for each case, but let’s build our visual from the crimes that take a year or longer to report.
crimeYear=crime_nona[crime_nona$DaysToReport>=365,]
Let me see if flipping helps you see better:
titleText="Crimes that took longer than one year to report"
baseYear=ggplot(data=crimeYear,
aes(x=Precinct,
y=DaysToReport))
boxYear=baseYear + geom_boxplot() +
labs(title = titleText)
# flipping
boxYear + coord_flip()
If we are showing the days in takes above a year, we might change the unit to years instead of days:
crimeYear$YearsToReport=crimeYear$DaysToReport/365
Let’s redo the previous boxplot, but using reordering the category by the median of the numeric variable:
baseYear=ggplot(data=crimeYear,
aes(x=reorder(Precinct,
YearsToReport,
median),
y=YearsToReport))
boxYear=baseYear + geom_boxplot() + labs(title =titleText)
# flipping
boxYear + coord_flip()
What if we use the histogram:
baseHY=ggplot(data=crimeYear,
aes(x=YearsToReport))
histHY=baseHY + geom_histogram(aes(fill=Precinct),
color='black') #color the border
histHY
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You need facets:
histHY + facet_grid(~Precinct)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The alternative without legend:
histHY + facet_grid(Precinct~.) + guides(fill="none")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What about reordering:
histHYre= histHY + facet_grid(reorder(Precinct,
-DaysToReport,
median)~.) + guides(fill="none")
histHYre
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Another common visual is the mean-error plot, which shows the mean of the numeric variable including a confidence interval. Let me first recall the two variables I have been using:
crimeYear[,c('Precinct', 'YearsToReport')] %>%head(20)
## Precinct YearsToReport
## 1885 NORTH 1.000000
## 2122 SOUTHWEST 1.002740
## 2886 WEST 3.569863
## 2901 WEST 3.005479
## 2977 NORTH 2.704110
## 3029 SOUTH 1.301370
## 4317 NORTH 5.005479
## 4502 WEST 1.005479
## 5366 SOUTHWEST 1.991781
## 5848 EAST 1.000000
## 6053 EAST 1.024658
## 6084 EAST 4.380822
## 7019 NORTH 1.731507
## 7020 NORTH 2.583562
## 7914 EAST 1.002740
## 7995 WEST 46.030137
## 7996 NORTH 39.090411
## 8003 SOUTHWEST 31.668493
## 8005 WEST 31.221918
## 8007 SOUTHWEST 32.854795
The plan is to show the mean per precinct:
library(Rmisc)
## Loading required package: lattice
## Loading required package: plyr
##
## Attaching package: 'plyr'
## The following object is masked from 'package:ggpubr':
##
## mutate
Rmisc::group.CI(data=crimeYear,
YearsToReport~Precinct)
## Precinct YearsToReport.upper YearsToReport.mean YearsToReport.lower
## 1 EAST 3.387550 2.916302 2.445055
## 2 NORTH 3.533030 3.219861 2.906691
## 3 SOUTH 4.203968 3.717830 3.231692
## 4 SOUTHWEST 5.012880 4.335640 3.658401
## 5 WEST 3.075696 2.668819 2.261941
Let’s represent that:
baseMEANs=ggplot(crimeYear, aes(x=Precinct,
y=YearsToReport)) +
theme_classic()
pointMEANS=baseMEANs + stat_summary(fun = mean,
geom = "point")
pointMEANS
We can add now the error bar:
pointErrors=pointMEANS + stat_summary(fun.data = mean_ci,
geom = "errorbar")
pointErrors
Error bars have a huge problem, they give you the illusion of symmetry. So, I recommend you include the data in the plot:
BarJit=pointErrors + geom_jitter(colour="blue",
alpha=0.2 #transparency
)
BarJit
Some might prefer a logarithmic scale on the vertical axis:
BarJit + scale_y_log10(breaks=c(1,1.5,3,10,25,50)) + geom_hline(yintercept = 50,linetype='dashed')
Let me use one of the date variables in the crime data set (the one with no missing values):
summary(crime_nona$Occurred.Date)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "1964-06-15" "2010-10-29" "2013-10-18" "2013-08-10" "2016-05-11" "2018-11-06"
A date is to be repeated if other crimes occur the same day. Then we should prepare a frequency tables of those dates:
crimeDate=as.data.frame(table(crime_nona$Occurred.Date)) # date will be a factor
head(crimeDate,10)
## Var1 Freq
## 1 1964-06-15 1
## 2 1973-01-01 1
## 3 1975-12-16 1
## 4 1978-01-01 1
## 5 1979-01-28 1
## 6 1979-07-04 1
## 7 1980-01-01 1
## 8 1981-02-14 1
## 9 1981-08-22 1
## 10 1985-09-13 1
The column with dates resulted into a factor when we computed the frequecy table. Let’s change the factor to date, and also rename the columns:
names(crimeDate)=c("date",'count')
#formatting column in Freq Table:
crimeDate$date=as.Date(crimeDate$date)
So now you have:
head(crimeDate)
## date count
## 1 1964-06-15 1
## 2 1973-01-01 1
## 3 1975-12-16 1
## 4 1978-01-01 1
## 5 1979-01-28 1
## 6 1979-07-04 1
Let’s show the line plot:
base=ggplot(crimeDate,
aes(x=date,y=count))
base + geom_line(alpha=0.3)
Let’s zoom-in to dates starting in 2010:
start <- as.Date("2010-1-1")
end <- NA
base=ggplot(crimeDate,
aes(x=date,y=count))
base + geom_line(alpha=0.3) + scale_x_date(limits = c(start, end))
## Warning: Removed 1174 row(s) containing missing values (geom_path).
Once we have daily counts we count use more of lubridate, like aggregating by month:
base=ggplot(crimeDate,
aes(x=floor_date(date, "month"),
y=count))
monthly= base + geom_line(alpha=0.3)
monthly= monthly + scale_x_date(limits = c(start, end))
monthly
## Warning: Removed 1174 row(s) containing missing values (geom_path).
We could also add a trend:
# adding a trend:
monthlyTrend = monthly + stat_smooth(color = "red",
method = "loess")
monthlyTrend
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1174 rows containing non-finite values (stat_smooth).
## Warning: Removed 1174 row(s) containing missing values (geom_path).
What about faceting by crime? However, our crimeDate data frame does not have that information. Let’s redo it:
crimeDate2=table(crime_nona$Occurred.Date,crime_nona$crimecat)%>%
as.data.frame() # date will be a factor
head(crimeDate2,10)
## Var1 Var2 Freq
## 1 1964-06-15 AGGRAVATED ASSAULT 0
## 2 1973-01-01 AGGRAVATED ASSAULT 0
## 3 1975-12-16 AGGRAVATED ASSAULT 0
## 4 1978-01-01 AGGRAVATED ASSAULT 0
## 5 1979-01-28 AGGRAVATED ASSAULT 0
## 6 1979-07-04 AGGRAVATED ASSAULT 0
## 7 1980-01-01 AGGRAVATED ASSAULT 0
## 8 1981-02-14 AGGRAVATED ASSAULT 0
## 9 1981-08-22 AGGRAVATED ASSAULT 0
## 10 1985-09-13 AGGRAVATED ASSAULT 0
The column of dates appears as a factor; let’s reformat crimeDate:
names(crimeDate2)=c("date","crime",'count')
#formatting column in Freq Table:
crimeDate2$date=as.Date(crimeDate2$date)
#then
head(crimeDate2,10)
## date crime count
## 1 1964-06-15 AGGRAVATED ASSAULT 0
## 2 1973-01-01 AGGRAVATED ASSAULT 0
## 3 1975-12-16 AGGRAVATED ASSAULT 0
## 4 1978-01-01 AGGRAVATED ASSAULT 0
## 5 1979-01-28 AGGRAVATED ASSAULT 0
## 6 1979-07-04 AGGRAVATED ASSAULT 0
## 7 1980-01-01 AGGRAVATED ASSAULT 0
## 8 1981-02-14 AGGRAVATED ASSAULT 0
## 9 1981-08-22 AGGRAVATED ASSAULT 0
## 10 1985-09-13 AGGRAVATED ASSAULT 0
Let’s keep focusing from 2010:
base=ggplot(crimeDate2,
aes(x=floor_date(date, "month"),
y=count))
monthly= base + geom_line(alpha=0.3)
monthly= monthly + scale_x_date(limits = c(start, end))
# adding a trend:
monthly = monthly + stat_smooth(color = "red",
method = "loess")
monthly + facet_wrap(~crime)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 23480 rows containing non-finite values (stat_smooth).
## Warning: Removed 1174 row(s) containing missing values (geom_path).
Alternatively,
monthly + facet_wrap(~reorder(crime,-count))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 23480 rows containing non-finite values (stat_smooth).
## Warning: Removed 1174 row(s) containing missing values (geom_path).
We just reorganized the previous plot so that we highlight the most and least common crimes along that time period.
So far, lines have been used to report counts of the crimes. We can also analyze the distribution of the counts using histograms. I mean:
crime2010since=crime[crime$Occurred.Date>"2010-1-1",]
crimeTotalCountsDay=as.data.frame(table(crime2010since$Occurred.Date))
crimeTotalCountsDay$Var1=as.Date(crimeTotalCountsDay$Var1)
names(crimeTotalCountsDay)=c('date','counts')
ggplot(data=crimeTotalCountsDay, aes(x=counts)) + geom_histogram() + xlab("crime per day")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The plot above shows a distribution of crimes per day since 2010. Check this summary:
summary(crimeTotalCountsDay$counts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43.0 114.0 127.0 127.1 140.0 199.0
This is telling that you had a day when there were 199 crimes:
# checking the original data:
sort(table(crime2010since$Occurred.Date),decreasing = T)[1:3]
##
## 2017-07-01 2017-05-26 2016-01-20
## 199 192 186
From the summary you also know that since 2010, you can expect at least 43 crimes per day, or that the mean number of crimes per day is 126. Let’s see a distribution per year:
tapply(crimeTotalCountsDay$counts,
year(crimeTotalCountsDay$date), FUN=summary)
## $`2010`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59.0 107.0 119.0 118.6 129.0 171.0
##
## $`2011`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 58.0 100.0 113.0 113.1 124.0 169.0
##
## $`2012`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43.0 102.2 112.0 112.0 122.0 164.0
##
## $`2013`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 82.0 112.0 124.0 124.8 137.0 174.0
##
## $`2014`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 77.0 122.0 136.0 135.1 147.0 181.0
##
## $`2015`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 89.0 119.0 130.0 130.6 142.0 184.0
##
## $`2016`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 82.0 124.2 134.0 134.4 145.0 186.0
##
## $`2017`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 82.0 127.0 138.0 137.8 148.0 199.0
##
## $`2018`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.0 129.2 139.0 139.1 148.8 183.0
If you need to plot this information, we need to add the column with year to the frequency table crimeTotalCountsDay:
crimeTotalCountsDay$year=year(crimeTotalCountsDay$date)
#you have
head(crimeTotalCountsDay,15)
## date counts year
## 1 2010-01-02 128 2010
## 2 2010-01-03 128 2010
## 3 2010-01-04 135 2010
## 4 2010-01-05 136 2010
## 5 2010-01-06 133 2010
## 6 2010-01-07 141 2010
## 7 2010-01-08 171 2010
## 8 2010-01-09 151 2010
## 9 2010-01-10 116 2010
## 10 2010-01-11 127 2010
## 11 2010-01-12 147 2010
## 12 2010-01-13 119 2010
## 13 2010-01-14 149 2010
## 14 2010-01-15 158 2010
## 15 2010-01-16 141 2010
Now, you can plot by year:
base = ggplot(crimeTotalCountsDay,
aes(x = counts))
HistSimple=base + geom_histogram(fill='grey', color=NA)
HistFacet=HistSimple+ facet_wrap(~year(crimeTotalCountsDay$date),
ncol = 1, #all in one column
strip.position = 'right')#,#year
HistFacet
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In general, it would look better if we use densities:
base = ggplot(crimeTotalCountsDay,
aes(x = counts)) + theme_classic()
densePlot=base + geom_density(fill='grey', color=NA)
denseFacet=densePlot+ facet_wrap(~year(crimeTotalCountsDay$date),
ncol = 1, #all in one column
strip.position = 'right')#,#year
denseFacet
You can improve this with:
densePlot +
# reduce space between density plot
theme(panel.spacing.y = unit(0.1, "lines"),
# no title on y
axis.title.y = element_blank(),
# no text on y
axis.text.y = element_blank(),
# no line on y
axis.line.y = element_blank(),
# no ticks on y
axis.ticks.y = element_blank(),
# the border and background of each year in facet:
strip.background = element_rect(colour="white"),
# the text of each year in facet
strip.text.y = element_text(size=12,
color="grey",
angle = 0))
We can also use similar plots to the ones used in the previous material (cat-num). Let’s keep duration longer than a year, and after 2000:
# new filtered data frame
crimeY2000=crime_nona[crime_nona$year>=2000 & crime_nona$DaysToReport>=365,]
# create new variable in YEARS:
crimeY2000$YearsToReport=crimeY2000$DaysToReport/365
#boxplot by Year
base=ggplot(data = crimeY2000,
aes(x=as.factor(year),
y=YearsToReport)) # is not ONE value
boxByYear=base + geom_boxplot()
boxByYear
Remember that although the boxplot is very informative, I recommend the use of familiar statistics :
# vector of colors named as stats:
myFunCols=c(Max='black',Median='purple',Min='blue')
# plotting
baseINV=ggplot(crimeY2000, aes(x=as.factor(year),
y=YearsToReport)) +
theme_classic()
# points with AESthetics:
MAXs=baseINV + stat_summary(fun = max, geom = "point",
aes(color='Max'))
MAXMINs=MAXs + stat_summary(fun = min, geom = "point",
aes(color='Min'))
MxMnMDs=MAXMINs+ stat_summary(fun = median, geom = "point",
aes(color='Median'))
MxMnMDs= MxMnMDs + scale_colour_manual(values=myFunCols,
name="Stats")
MxMnMDs
Notice that if we want to connect the points this may not be what you have in mind:
MxMnMDs + geom_line()
Remember that YearsToReport is NOT one value, but several. Then, connecting the dots requires to override the grouping of the dots, so we add group=1):
MxMnMDsL=MxMnMDs+ geom_line(stat="summary",fun='max',
aes(color="Max"),
group=1) +
geom_line(stat="summary",fun='min',
aes(color="Min"),
group=1) +
geom_line(stat="summary",fun='median',
aes(color="Median"),
group=1)
MxMnMDsL + theme(axis.text.x = element_text(angle = 60,
vjust = 0.5
))
The study of bivariate relationships among numerical variables is known as correlation analysis. The data we have been using has few numerical columns, but we will produce two by aggregating the original data since 2015 by Neigborhood:
crime2015=crime_nona[crime_nona$year>=2015,]
# 1. MEAN of days it takes to report a crime by neighborhood
daysByNeigh=aggregate(data=crime2015,DaysToReport~Neighborhood,mean)
# you have:
head(daysByNeigh)
## Neighborhood DaysToReport
## 1 ALASKA JUNCTION 3.624267
## 2 ALKI 3.922434
## 3 BALLARD NORTH 4.248983
## 4 BALLARD SOUTH 3.844756
## 5 BELLTOWN 2.497066
## 6 BITTERLAKE 4.383847
# 2. Crimes by neighborhood
crimesByNeigh=as.data.frame(100*prop.table(table(crime2015$Neighborhood)))
names(crimesByNeigh)=c('Neighborhood', 'CrimeShare')
head(crimesByNeigh)
## Neighborhood CrimeShare
## 1 ALASKA JUNCTION 1.5336142
## 2 ALKI 0.4430089
## 3 BALLARD NORTH 2.2081719
## 4 BALLARD SOUTH 3.3167513
## 5 BELLTOWN 2.7928590
## 6 BITTERLAKE 1.8523903
Since both data frames have the same neighboorhood, we can make one data frame by merging both:
num_num=merge(daysByNeigh,crimesByNeigh,by='Neighborhood')
str(num_num)
## 'data.frame': 58 obs. of 3 variables:
## $ Neighborhood: Factor w/ 58 levels "ALASKA JUNCTION",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ DaysToReport: num 3.62 3.92 4.25 3.84 2.5 ...
## $ CrimeShare : num 1.534 0.443 2.208 3.317 2.793 ...
Let’s turn the Neighborhood into characters:
num_num$Neighborhood=as.character(num_num$Neighborhood)
Once we have the data organized, the usual option is the scatterplot:
base = ggplot(num_num, aes(x=DaysToReport,y=CrimeShare))
plot1= base + geom_point()
plot1
If you compute the Pearson correlation coefficient, you may not find a relevant correlation interesting:
cor.test(num_num$DaysToReport,num_num$CrimeShare,method = "pearson")
##
## Pearson's product-moment correlation
##
## data: num_num$DaysToReport and num_num$CrimeShare
## t = -1.8145, df = 56, p-value = 0.07496
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.46559934 0.02412146
## sample estimates:
## cor
## -0.2356425
However, you can visually find something relevant. Let’s use ggrepel to show labels:
library(ggrepel)
plot1 + geom_text_repel(aes(label=Neighborhood))
## Warning: ggrepel: 44 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
plot1 + geom_text_repel(aes(label=Neighborhood),size=2,
min.segment.length = 0,
max.overlaps = 100)
Now we can limit the labels, annotating the ones that represent at least 5% of the crimes in the city:
plot1 + geom_text_repel(aes(label=ifelse(CrimeShare>=5,Neighborhood, "")),size=2,
min.segment.length = 0,
max.overlaps = 100)
Or the ones that take longer than a week to report:
plot1 + geom_text_repel(aes(label=ifelse(DaysToReport>7,Neighborhood, "")),
size=2,
min.segment.length = 0,
max.overlaps = 100)
Besides conditionally annotating the places, you can identify the area of the most salient behavior. Let’s highlight overlaping points:
library(hexbin)
heat = base + geom_hex(bins = 10) +
scale_fill_distiller(palette ="Greys",
direction=1)
heat
The palettes can be selected from the brewer colors website. Using the same palette as before, we can try a different plot (stat_density_2d):
heat2 = base + stat_density_2d(aes(fill = ..density..),
geom = "raster", contour = FALSE) + scale_fill_distiller(palette="Reds", direction=1)
heat2
Now you have an approximate of the places representing the most common behavior:
heat2 + theme_light() +
geom_text_repel(aes(label=Neighborhood),size=2,color='black',
fontface='bold',
min.segment.length = 0,
max.overlaps = 100) +
scale_x_continuous(expand = c(0, 0),limits = c(2,5)) +
scale_y_continuous(expand = c(0, 0),limits = c(0.5,3))
## Warning: Removed 27 rows containing non-finite values (stat_density2d).
## Warning: Removed 396 rows containing missing values (geom_raster).
## Warning: Removed 27 rows containing missing values (geom_text_repel).
You can use geom_label_repel in some cases (but in this case might not work well).
heat2 + theme_light() +
geom_label_repel(aes(label=Neighborhood),size=2,color='black',
fontface='bold',
min.segment.length = 0,
max.overlaps = 100) +
scale_x_continuous(expand = c(0, 0),limits = c(2,5)) +
scale_y_continuous(expand = c(0, 0),limits = c(0.5,3))
## Warning: Removed 27 rows containing non-finite values (stat_density2d).
## Warning: Removed 396 rows containing missing values (geom_raster).
## Warning: Removed 27 rows containing missing values (geom_label_repel).
Sometimes increasing the size helps:
heat2 + theme_light() +
geom_label_repel(aes(label=Neighborhood),size=2,color='black',
fontface='bold',
min.segment.length = 0,
max.overlaps = 100) +
scale_x_continuous(expand = c(0, 0),limits = c(2,5)) +
scale_y_continuous(expand = c(0, 0),limits = c(0.5,3))
## Warning: Removed 27 rows containing non-finite values (stat_density2d).
## Warning: Removed 396 rows containing missing values (geom_raster).
## Warning: Removed 27 rows containing missing values (geom_label_repel).
Some of these plot work better when in interactive mode.